#Libraries
library(knitr)
library(dplyr)
library(plotly)
library(ggcorrplot)
library(ggmosaic)
library(scales)
library(fastDummies)
library(caret)
colors <- c('#0d0887', '#46039f', '#7201a8', '#9c179e', '#bd3786', '#d8576b', '#ed7953', '#fb9f3a', '#fdca26', '#f0f921')

Introduction

Research Scenario

In 1963, the Equal Pay Act was passed, requiring equal wages to men and women in all workplaces. Despite the federal law against gender inequality and discrimination, the unjust practices continued in the workplace. Some of the most prominent forms of workplace gender discrimination include compensation and promotion. While some progress has been made, workplace gender discrimination and inequality are complex and systemic issues that continue to persist.

This analysis utilizes the data from a United States District Court of Houston case that arose under Title VII of the Civil Rights Act of 1964. The plaintiffs in the case were female doctors at Houston College of Medicine. The plaintiffs claimed that the University was engaging in patterns and practices of discrimination against women, most notably in gender pay and promotion discrimination.

The purpose of this analysis is to investigate the dataset used in the gender discrimination case against Houston College of Medicine to analyze and conduct inferences about gender pay and promotion inequalities.

Research Question

Specifically, this analysis is aiming to answer the following research questions:

  • Is the population mean salary greater for male university doctors than female university doctors? Does the inference change when controlling for other factors?

  • Is there a regression equation to predict the salary of a university doctor based on the variables available? Which factors are most important in predicting salary and is gender a significant predictor?

  • Are female doctors less likely to be promoted? Such that, is the underlying population proportion of full professors among male university doctors greater than the population proportion of full professors among female university doctors?

Preparation

The Dataset

The dataset used for this analysis was presented by the plaintiffs in the gender discrimination lawsuit, presenting relevant information about the background, experience, and salary of 261 doctors.

Source: https://www.kaggle.com/hjmjerry/gender-discrimination?select=Lawsuit.csv

The table below displays the first few rows of the dataset:

data <- read.csv("data.csv")
colnames(data) <- c("ID", "Dept", "Gender", "Emphasis", "Certified", "Publication.Rate", "Experience", "Rank", "Salary.Year1", "Salary.Year2")
kable(head(data))
ID Dept Gender Emphasis Certified Publication.Rate Experience Rank Salary.Year1 Salary.Year2
1 Biology Male Research Not_Certified 7.4 9 Full_Professor 77836 84612
2 Biology Male Research Not_Certified 6.7 10 Associate 69994 78497
3 Biology Male Research Not_Certified 8.1 6 Assistant 62872 67756
4 Biology Male Clinical Board_Certified 5.1 27 Full_Professor 155196 173220
5 Biology Male Research Not_Certified 7.0 10 Full_Professor 89268 96099
6 Biology Male Research Board_Certified 7.7 10 Full_Professor 79714 87531
# TO DO - Delete
data$Log.Salary <- log(data$Salary.Year1)
male <- data[which(data$Gender == "Male"),]
female <- data[which(data$Gender == "Female"),]

Variables

The list below describes the variables of interest included in the dataset:

  • ID: Unique identifier of the doctor
  • Dept: The doctor’s department - Biology, Physiology, Genetics, Pediatrics, Medicine, or Surgery
  • Gender: The doctor’s gender - Male or Female (note that there were no non-binary identifying doctors present in the dataset)
  • Emphasis: Whether the doctor specializes in primarily a clinical emphasis or primarily a research emphasis
  • Certified: Whether the doctor is or is not board certified
  • Publication Rate: The number of publications on CV divided by the number of years between CV date and MD date
  • Experience: The number of years since the doctor has obtained their MD
  • Rank: The level of the doctor at the university: Assistant, Associate, or Full Professor
  • Salary Year 1: The salary of the doctor in the first academic year measured
  • Salary Year 2: The salary of the doctor in the second academic year measured

Data Cleaning

As part of the data preparation process, the first part is to clean the data. The five steps below outline the process of data cleaning to ensure the data is sound when used for analysis.

Step 1: Remove Duplicate or Irrelevant Observations

  • Each observation in the dataset is an unique and independent observation
  • All variables and observations are initially relevant to the analysis
unique <- length(unique(data$ID)) == nrow(data)

Step 2: Fix Structural Errors

  • There are no structural errors present in the dataset
#Decided to not factor categorical data

Step 3: Filter Unwanted Outliers

  • While there are outliers present in the dataset, there is no evidence to suggest that the outliers do not belong to their respective distribution; thus, no outliers will be removed
salary <- data$Salary.Year1
stats <- summary(salary)
iqr <- stats[5] - stats[2]
lower <- stats[2] - (1.5 * iqr)
upper <- stats[5] + (1.5 * iqr)
outliers <- salary[which(salary > upper | salary < lower)]

Step 4: Handle Missing Data

  • There are no missing data entries in the dataset
na.rows <- data[rowSums(is.na(data)) > 0,]

Data Transformation

For the majority of the analysis, “Salary Year 1” will be the independent variable, and thus, the focus of the analysis.

The histogram below examines the distribution of salary. Notice that the variable has a significant right skew. This attribute is problematic because when the distribution does not follow a normal curve, we cannot make certain assumptions required for various analyses.

plot_ly(data, x = ~Salary.Year1, type = "histogram", marker = list(color = colors[1]), opacity = .7) %>%
  layout(title = "Histogram of Salary", xaxis = list(title = "Salary ($)"), yaxis = list(title = "Frequency"))

However, this problem can be fixed with a logarithmic transformation of salary to make the data follow a normal distribution. The histogram below demonstrates the log transformation of salary. Notice that the variable now follows an approximately normal distribution, thus, the statistical analyses can be performed on the log transformation of salary.

data$Log.Salary <- log(data$Salary.Year1)

plot_ly(data, x = ~Log.Salary, type = "histogram", marker = list(color = colors[2]), opacity = .7) %>%
  layout(title = "Histogram of Log Transformation of Salary", xaxis = list(title = "Log Salary ($)"), yaxis = list(title = "Frequency"))

Data Exploration

Before we being our analysis, let’s explore the variables included in the dataset to gain a better understanding of the research scenario.

Gender

The pie chart below visualizes the percentage of males and females included in the dataset. There is a sufficient number of observations for each gender to conduct an analysis.

plot_ly(labels = c("Male", "Female"), values = c(nrow(male), nrow(female)), type = "pie", marker = list(colors = colors[2:1], line = list(color = "white", width = 2)), opacity = .6, textinfo = "label+percent", showlegend  = FALSE) %>%
  layout(title = "Percentage of Obervations by Gender")

Salary

The following density graph illustrates the distribution of salary (without the logarithmic transformation) by gender. Notice that the female distribution is much narrower than the male distribution. Additionally, notice that both the male and female distributions have a right-skew. Recall that the right-skew is reflected in the general distribution of salary.

density.f <- density(female$Salary.Year1)
density.m <- density(male$Salary.Year1)

plot_ly(x = ~density.f$x, y = ~density.f$y, type = "scatter", mode = "lines", name = "Female", fill = "tozeroy", fillcolor="#0d0887CC", line=list( color="#0d0887CC")) %>% 
  add_trace(x = ~density.m$x, y = ~density.m$y, name = "Male", fill = "tozeroy", fillcolor="#46039fCC", line=list( color="#46039fCC")) %>% 
  layout(title = "Density of Salary by Gender", xaxis = list(title = "Salary ($)"), yaxis = list(title = "Density"))

Department

The bar chart below examines the median salary for each department. Notice that departments such as Surgery have a higher median pay than departments such as Biology and Physiology. This is an important factor to consider during the analysis as some departments tend to pay higher than other departments.

dept <- aggregate(data$Salary.Year1, list(data$Dept), FUN=median)
colnames(dept) = c("Dept", "Median.Salary")
dept$Dept <- factor(dept$Dept, levels = dept$Dept[order(dept$Median.Salary)])

plot_ly(dept, y = ~Dept, x = ~Median.Salary, color = ~Dept, colors = colors[10:1], type = "bar", orientation = "h", opacity = .7) %>%
  layout(title = "Median Salary by Department", xaxis = list(title = "Median Salary ($)"), yaxis = list(title = "Department"))

Mosaic Plots

The following mosaic plots examine the frequency of male and female doctors that belong to the categories included in the variables Rank, Certification, and Emphasis. In the mosaic plots for Certification and Emphasis, there does not appear to be any significant disparities between the percentage of male versus female doctors that are board certified & not board certified and clinical emphasis & research emphasis. However, in the mosaic plot for Professorship Level, notice the low percentage of male assistant professors compared to female assistant professors and the high percentage of male full professors compared to female full professors.

plt <- ggplot(data = data) +
  geom_mosaic(aes(x = product(Gender), fill = Rank)) + 
  xlab("Gender") +
  ylab("Rank") +
  ggtitle("Mosaic Plot of Professorship Level by Gender") + 
     scale_fill_manual(values=colors[3:1])
ggplotly(plt)
plt <- ggplot(data = data) +
  geom_mosaic(aes(x = product(Gender), fill = Certified)) + 
  xlab("Gender") +
  ylab("Certified") +
  ggtitle("Mosaic Plot of Certification by Gender") + 
     scale_fill_manual(values=colors[5:4])
ggplotly(plt)
plt <- ggplot(data = data) +
  geom_mosaic(aes(x = product(Gender), fill = Emphasis)) + 
  xlab("Gender") +
  ylab("Emphasis") +
  ggtitle("Mosaic Plot of Emphasis by Gender") + 
     scale_fill_manual(values=colors[7:6])
ggplotly(plt)

Boxplots

The boxplots below examine the distribution of salary for the variables Rank, Certification, and Emphasis grouped by gender. Notice that the full professor, board certified, and clinical emphasis have a higher median salary than their respective catagories. However, there are distinct distribution differences between females and males among the three variables.

plot_ly(data, x = ~Rank, y = ~Salary.Year1, color = ~Gender, colors = colors[1:2], type = "box", showlegend = TRUE) %>% 
  layout(boxmode = "group", title = "Distribuiton of Salary for Professorship Level by Gender", yaxis = list(title = "Salary ($)"))
plot_ly(data, x = ~Certified, y = ~Salary.Year1, color = ~Gender, colors = colors[3:4], type = "box", showlegend = TRUE) %>% 
  layout(boxmode = "group", title = "Distribuiton of Salary for Certification by Gender", yaxis = list(title = "Salary ($)"))
plot_ly(data, x = ~Emphasis, y = ~Salary.Year1, color = ~Gender, colors = colors[5:6], type = "box", showlegend = TRUE) %>% 
  layout(boxmode = "group", title = "Distribuiton of Salary for Emphasis by Gender", yaxis = list(title = "Salary ($)"))

Scatter Plots

The scatter plots below visualizes the linear association between salary versus publication rate and experience. Notice the moderate-strong negative linear relationship between salary and publication rate (r = -0.73), meaning as the publication rate decreases, the salary is likely to increase. However, notice the weak-moderate positive linear relationship between the salary and experience (r = 0.30), meaning as the number of years of experience increases, the salary could possibly increase. However, the correlation coefficient indicates that experience is not a strong predictor of salary.

r <- cor(data$Publication.Rate, data$Salary.Year1)
plot_ly(data, x = ~Publication.Rate, y = ~Salary.Year1, type = "scatter", mode = "markers", color = ~Salary.Year1, colors = colors, size = ~Salary.Year1) %>% 
  layout(title = "Scatter Plot of Publication Rate vs. Salary", xaxis = list(title = "Publication Rate"), yaxis = list(title = "Salary ($)")) %>%
hide_colorbar()
r <- cor(data$Experience, data$Salary.Year1)
plot_ly(data, x = ~Experience, y = ~Salary.Year1, type = "scatter", mode = "markers", color = ~Salary.Year1, colors = colors, size = ~Salary.Year1) %>% 
  layout(title = "Scatter Plot of Experience vs. Salary", xaxis = list(title = "Experience (Years)"), yaxis = list(title = "Salary ($)")) %>%
hide_colorbar()

Analysis

Two-Sample Test of Means

We are interested in whether or not university doctors that are male have a greater salary than university doctors that are female. To conduct this assessment, we will perform a two-sample test of means to make an inference about the difference in population means of the two groups.

Assumptions

The following conditions must be met in order to make the necessary inferences from the two-sample t test:

1. Samples must be independent and randomly selected from the two distinct populations of interest
  • We can assume that the observations collected in the dataset are independent. Such that each observation is unique and do not influence each other.
2. The variable of interest must be measured in the same way in each of the populations
  • The variable of interest, salary, is measured in the same way for male and female university doctors.
3. The parameter of interest should be normally distributed or at least have similar shapes and without outliers
  • While the distribution of salary is skewed right, this is the case for both groups, thus, upholding the assumption

Hypothesis Test

The following steps conduct a hypothesis test to make an inference about the difference in population means of the salary of male university doctors and female university doctors.

1. Set up the hypotheses and select the alpha level

Let \(\mu_1\) = the population mean of male university doctors’ salaries and \(\mu_2\) = the population mean of female university doctors’ salaries

  • \(H_0: \mu_1 = \mu_2\) (the mean salary is the same between males and females)
  • \(H_1: \mu_1 > \mu_2\) (the mean salary is greater for males versus females)
  • \(\alpha = 0.05\)
alpha <- 0.05
2. Select the appropriate test-statistic and degrees of freedom
  • \(t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s^2_1}{n_1} + \frac{s^2_2}{n_2}}}\)
  • \(df = min(n_1 -1, n_2 -1)\)
df <- min(nrow(male)-1, nrow(female)-1)
3. State the decision rule

Determine the appropriate critical value from the standard t-distribution table associated with a right hand tail probability of \(\alpha = 0.05\) and \(df = 105\). The appropriate critical value is 1.659.

  • Decision Rule: Reject \(H_0\) if \(t \ge 1.659\)
  • Otherwise, do not reject \(H_0\)
critical.value <- qt(p=alpha, df=df, lower.tail=FALSE)
4. Compute the test statistic and the associated p-value
  • \(t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s^2_1}{n_1} + \frac{s^2_2}{n_2}}} \approx 6.646\)
  • \(p-value = 8.9e^{-11}\)
t <- (mean(male$Salary.Year1) - mean(female$Salary.Year1)) / (sqrt((var(male$Salary.Year1)/length(male$Salary.Year1)) + (var(female$Salary.Year1)/length(female$Salary.Year1))))

t.test <- t.test(male$Salary.Year1, female$Salary.Year1, alternative = "greater", conf.level=.95)
5. Conclusion

Reject \(H_0\) since \(6.646 > 1.659\). There is significant evidence at the \(\alpha = 0.05\) level that the population mean salary is greater for male university doctors than female university doctors.

Confidence Interval

The confidence interval for the difference in population means is given by the following formula:

  • \(\bar{x}_1 - \bar{x}_2 \pm t * {\sqrt{\frac{s^2_1}{n_1} + \frac{s^2_2}{n_2}}}\)
  • (43868.76, 73066.22)

We are 95% confident that the true mean difference in the salary between male university doctors and female university doctors is between $43,868.76 and $73,066.22.

dif <- mean(male$Salary.Year1) - mean(female$Salary.Year1)
se <- sqrt((var(male$Salary.Year1)/length(male$Salary.Year1)) + (var(female$Salary.Year1)/length(female$Salary.Year1)))

lower <- dif - (critical.value * se)
upper <- dif + (critical.value * se)

Categories

From the two-sample test of means above, we concluded that the population mean salary is greater for male university doctors than female university doctors. However, this statement does include a level of basis as we were not accounting for the doctor’s professorship level (rank), emphasis, or certification. As we discovered in the data exploration section, professorship level (Assistant, Associate, or Full Professor), certification (Board Certified or Not Certified), and emphasis (Clinical or Research) all influence the doctor’s salary. Note that while department also influences the doctor’s salary, there is not a sufficient number of observations to filter on each department.

We are interested in whether or not university doctors that are male have a greater salary than university doctors that are female when controlling for professorship level, certification, and emphasis. To conduct this assessment, we will perform a two-sample test of means to make an inference about the difference in population means of the two groups (male - female) for each of the 12 categories.

The following table summarizes the results from the hypothesis test for each category. Note that categories with “NA” did not have a sufficient number of observations to complete the test.

# Assistant

#Assistant, Certified, Clinical
m1 <- male[male$Rank == "Assistant" &
           male$Certified == "Board_Certified" &
           male$Emphasis == "Clinical",]
f1 <- female[female$Rank == "Assistant" &
           female$Certified == "Board_Certified" &
           female$Emphasis == "Clinical",]
t1 <- t.test(m1$Salary.Year1, f1$Salary.Year1, alternative = "greater", conf.level=.95)

#Assistant, Certified, Research
m2 <- male[male$Rank == "Assistant" &
           male$Certified == "Board_Certified" &
           male$Emphasis == "Research",]
f2 <- female[female$Rank == "Assistant" &
           female$Certified == "Board_Certified" &
           female$Emphasis == "Research",]
#t2 <- t.test(m2$Salary.Year1, f2$Salary.Year1, alternative = "greater", conf.level=.95)
t2 <- "na"

#Assistant, Not Certified, Clinical
m3 <- male[male$Rank == "Assistant" &
           male$Certified == "Not_Certified" &
           male$Emphasis == "Clinical",]
f3 <- female[female$Rank == "Assistant" &
           female$Certified == "Not_Certified" &
           female$Emphasis == "Clinical",]
t3 <- t.test(m3$Salary.Year1, f3$Salary.Year1, alternative = "greater", conf.level=.95)

#Assistant, Not Certified, Research
m4 <- male[male$Rank == "Assistant" &
           male$Certified == "Not_Certified" &
           male$Emphasis == "Research",]
f4 <- female[female$Rank == "Assistant" &
           female$Certified == "Not_Certified" &
           female$Emphasis == "Research",]
t4 <- t.test(m4$Salary.Year1, f4$Salary.Year1, alternative = "greater", conf.level=.95)


# Associate

#Associate, Certified, Clinical
m5 <- male[male$Rank == "Associate" &
           male$Certified == "Board_Certified" &
           male$Emphasis == "Clinical",]
f5 <- female[female$Rank == "Associate" &
           female$Certified == "Board_Certified" &
           female$Emphasis == "Clinical",]
t5 <- t.test(m5$Salary.Year1, f5$Salary.Year1, alternative = "greater", conf.level=.95)

#Associate, Certified, Research
m6 <- male[male$Rank == "Associate" &
           male$Certified == "Board_Certified" &
           male$Emphasis == "Research",]
f6 <- female[female$Rank == "Associate" &
           female$Certified == "Board_Certified" &
           female$Emphasis == "Research",]
t6 <- t.test(m6$Salary.Year1, f6$Salary.Year1, alternative = "greater", conf.level=.95)

#Associate, Not Certified, Clinical
m7 <- male[male$Rank == "Associate" &
           male$Certified == "Not_Certified" &
           male$Emphasis == "Clinical",]
f7 <- female[female$Rank == "Associate" &
           female$Certified == "Not_Certified" &
           female$Emphasis == "Clinical",]
#t7 <- t.test(m7$Salary.Year1, f7$Salary.Year1, alternative = "greater", conf.level=.95)
t7 <- "na"

#Associate, Not Certified, Research
m8 <- male[male$Rank == "Associate" &
           male$Certified == "Not_Certified" &
           male$Emphasis == "Research",]
f8 <- female[female$Rank == "Associate" &
           female$Certified == "Not_Certified" &
           female$Emphasis == "Research",]
t8 <- t.test(m8$Salary.Year1, f8$Salary.Year1, alternative = "greater", conf.level=.95)


# Full Professor

#Full Professor, Certified, Clinical
m9 <- male[male$Rank == "Full_Professor" &
           male$Certified == "Board_Certified" &
           male$Emphasis == "Clinical",]
f9 <- female[female$Rank == "Full_Professor" &
           female$Certified == "Board_Certified" &
           female$Emphasis == "Clinical",]
t9 <- t.test(m9$Salary.Year1, f9$Salary.Year1, alternative = "greater", conf.level=.95)

#Full Professor, Certified, Research
m10 <- male[male$Rank == "Full_Professor" &
           male$Certified == "Board_Certified" &
           male$Emphasis == "Research",]
f10 <- female[female$Rank == "Full_Professor" &
           female$Certified == "Board_Certified" &
           female$Emphasis == "Research",]
t10 <- t.test(m10$Salary.Year1, f10$Salary.Year1, alternative = "greater", conf.level=.95)

#Full Professor, Not Certified, Clinical
m11 <- male[male$Rank == "Full_Professor" &
           male$Certified == "Not_Certified" &
           male$Emphasis == "Clinical",]
f11 <- female[female$Rank == "Full_Professor" &
           female$Certified == "Not_Certified" &
           female$Emphasis == "Clinical",]
t11 <- t.test(m11$Salary.Year1, f11$Salary.Year1, alternative = "greater", conf.level=.95)

#Full Professor, Not Certified, Research
m12 <- male[male$Rank == "Full_Professor" &
           male$Certified == "Not_Certified" &
           male$Emphasis == "Research",]
f12 <- female[female$Rank == "Full_Professor" &
           female$Certified == "Not_Certified" &
           female$Emphasis == "Research",]
t12 <- t.test(m12$Salary.Year1, f12$Salary.Year1, alternative = "greater", conf.level=.95)

categories <- c("Assistant, Certified, Clinical", "Assistant, Certified, Research","Assistant, Not Certified, Clinical", "Assistant, Not Certified, Research", "Associate, Certified, Clinical", "Associate, Certified, Research","Associate, Not Certified, Clinical", "Associate, Not Certified, Research", "Full Professor, Certified, Clinical", "Full Professor, Certified, Research","Full Professor, Not Certified, Clinical", "Full Professor, Not Certified, Research")
results <- data.frame(Category = categories,
                      t.Stat = c(t1$statistic, t2, t3$statistic, t4$statistic, t5$statistic, t6$statistic, t7, t8$statistic, t9$statistic, t10$statistic, t11$statistic, t12$statistic), 
                      p.Value = c(t1$p.value, t2, t3$p.value, t4$p.value, t5$p.value, t6$p.value, t7, t8$p.value, t9$p.value, t10$p.value, t11$p.value, t12$p.value))

results$t.Stat <- round(as.numeric(results$t.Stat), digit = 3)
results$p.Value <- round(as.numeric(results$p.Value), digit = 5)
results$Decision <- ifelse(results$p.Value < alpha, "Reject H0", "Fail to Reject H0")

kable(head(results,12))
Category t.Stat p.Value Decision
Assistant, Certified, Clinical 2.974 0.00213 Reject H0
Assistant, Certified, Research NA NA NA
Assistant, Not Certified, Clinical 1.745 0.05542 Fail to Reject H0
Assistant, Not Certified, Research 1.023 0.20668 Fail to Reject H0
Associate, Certified, Clinical 2.599 0.00835 Reject H0
Associate, Certified, Research 0.872 0.20311 Fail to Reject H0
Associate, Not Certified, Clinical NA NA NA
Associate, Not Certified, Research 0.584 0.28708 Fail to Reject H0
Full Professor, Certified, Clinical 1.226 0.12253 Fail to Reject H0
Full Professor, Certified, Research -0.854 0.73049 Fail to Reject H0
Full Professor, Not Certified, Clinical 1.278 0.20249 Fail to Reject H0
Full Professor, Not Certified, Research -0.293 0.60895 Fail to Reject H0

Notice that only two categories - 1) Assistant, Certified, Research and 2) Associate, Certified, Clinical - had significant evidence at the \(\alpha = 0.05\) level that the population mean salary is greater for male university doctors than female university doctors. Thus, for the remaining categories, we do not reject the null hypothesis that the population mean salary is the same for male university doctors than female university doctors.

For the categories that were statistically significant, the confidence interval for the difference in population means is provided below:

Assistant, Certified, Research

We are 95% confident that the true mean difference in the salary between male and female university doctors that are assistant professors, certified, and in clinical is between $17,941.25 and $65,911.96.

Associate, Certified, Research

We are 95% confident that the true mean difference in the salary between male and female university doctors that are associate professors, certified, and in clinical is between $19,102.00 and $107,161.10.

# Assistant, Certified, Research
df <- min(nrow(m1), nrow(f1))
critical.value <- qt(p=alpha, df=df, lower.tail=FALSE)
dif <- mean(m1$Salary.Year1) - mean(f1$Salary.Year1)
se <- sqrt((var(m1$Salary.Year1)/length(m1$Salary.Year1)) + (var(f1$Salary.Year1)/length(f1$Salary.Year1)))
lower <- dif - (critical.value * se)
upper <- dif + (critical.value * se)

# Associate, Certified, Research
df <- min(nrow(m5), nrow(f5))
critical.value <- qt(p=alpha, df=df, lower.tail=FALSE)
dif <- mean(m5$Salary.Year1) - mean(f5$Salary.Year1)
se <- sqrt((var(m5$Salary.Year1)/length(m5$Salary.Year1)) + (var(f5$Salary.Year1)/length(f5$Salary.Year1)))
lower <- dif - (critical.value * se)
upper <- dif + (critical.value * se)

Multiple Linear Regression

We are interested in modeling a multiple linear regression equation to understand the linear relationship between the response variable, salary, and the explanatory variables included in the dataset. Note: A logarithmic transformation was applied to the response variable, salary, to adjust for the normality of the variable.

Train & Test Split

Before we build the multiple linear regression, we first split the dataset into training set and testing set to perform cross validation on the model. The training set will be used to build the model and the testing set will validate the model by estimating the prediction error. Note: While only shown once, this process is repeated multiple times in the cross validation section of the analysis.

reg.data <- dummy_cols(data, select_columns = c("Dept", "Gender", "Emphasis", "Certified", "Rank"), remove_selected_columns = TRUE)
reg.data <- reg.data[,c(6, 14, 17, 15, 21, 20, 12, 9, 8, 10, 7, 3, 2)]

smp.size <- floor(.8 * nrow(reg.data))
set.seed(1997) # set seed for visualization purposes
train.index <- sample(seq_len(nrow(reg.data)), size = smp.size)
train <- reg.data[train.index, ]
test <- reg.data[-train.index, ]
plot_ly(labels = c("Train", "Test"), values = c(.8, .2), type = "pie", marker = list(colors = colors[1:2], line = list(color = "white", width = 2)), opacity = .6, textinfo = "label+percent", showlegend  = FALSE) %>%
  layout(title = "Train/Test Split")

Model

The least-squares regression line for multiple linear regression to predict salary is given by the following equation, where \(\hat{y}\) is the expected or predicted value based on the following variables included in the dataset.

Note: Since several of the variables in the dataset are categorical, dummy variables will be utilized in the regression equation.

  • Gender: \(x_1 =\) {1 - if male | 0 - if female}
  • Certified: \(x_2 =\) {1 - if board certified | 0 - if not certified}
  • Emphasis: \(x_3 =\) {1 - if clinical | 0 - if research}
  • Rank:
    • \(x_4 =\) {1 - if full professor | 0 - if not full professor}
    • \(x_5 =\) {1 - if associate | 0 - if not associate}
    • if \(x_4\) and \(x_5\) are 0, then rank is assistant
  • Department:
    • \(x_6 =\) {1 - if surgery | 0 - if not surgery}
    • \(x_7 =\) {1 - if medicine | 0 - if not medicine}
    • \(x_8 =\) {1 - if genetics | 0 - if not genetics}
    • \(x_9 =\) {1 - if pediatrics | 0 - if not pediatrics}
    • \(x_{10} =\) {1 - if biology | 0 - if not biology}
    • if \(x_6\) through \(x_{10}\) are 0, then department physiology
  • Experience: \(x_{11}\)
  • Publication Rate: \(x_{12}\)

\[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1{x_1} + \hat{\beta}_2{x_2} + \hat{\beta}_3{x_3} + \hat{\beta}_4{x_4} + \hat{\beta}_5{x_5} + \hat{\beta}_6{x_6} + \hat{\beta}_7{x_7} + \hat{\beta}_8{x_8} + \hat{\beta}_9{x_9} + \hat{\beta}_{10}{x_{10}} + \hat{\beta}_{11}{x_{11}} + \hat{\beta}_{12}{x_{12}}\]

\[\hat{y} = 10.99 + 0.04{x_1} + 0.18{x_2} + 0.14{x_3} + 0.22{x_4} + 0.14{x_5} + 1.01{x_6} + 0.64{x_7} + 0.32{x_8} + 0.29{x_9} + 0.16{x_{10}} + 0.02{x_{11}} - 0.03{x_{12}}\]

mlr <- lm(Log.Salary~., data = train)

Assumptions

1. Linear relationship

  • Assumption Requirement: There exists a linear relationship between each predictor variable and the response variable
  • For the categorical variables, dummy variables meet the assumption of linearity by definition
  • For the numeric variables, the residual plots below magnify any non-linearity between the explanatory variable and the predicted variables.
  • For experience, notice that the residual plot below is slightly grouped in the left, indicating only a moderate linear association between experience and log salary
  • For publication rate, notice that the residual plot below is randomly distributed with no distinct patterns, indicating a linear association between publication rate and log salary
plot_ly(data = train, x = ~Experience, y = resid(mlr), type = "scatter", mode = "markers", marker = list(color = colors[1]), opacity = .6) %>% 
  layout(title = "Scatter Plot of Experience vs. Residuals",
         xaxis = list(title = "Experience (years)"),
         yaxis = list(title = "Residuals"))
plot_ly(data = train, x = ~Publication.Rate, y = resid(mlr), type = "scatter", mode = "markers", marker = list(color = colors[2]), opacity = .6) %>% 
  layout(title = "Scatter Plot of Publication Rate vs. Residuals",
         xaxis = list(title = "Publication Rate"),
         yaxis = list(title = "Residuals"))

2. No Multicollinearity

  • Assumption Requirement: None of the predictor variables are highly correlated with each other
  • The correlation matrix below examines the linear relationship among the explanatory variables.
  • For the most part, there is no multicollinearity present in the variables. The only exception is between publication rate and clinical emphasis with a correlation of -0.85. Since this correlation is high, one of the two variables should be remove from the model.
cor_data <- train[,-1]
colnames(cor_data) <- c("Male", "Board Cert", "Clinical", "Full Prof", "Assoc Prof", "Surgery", "Medicine", "Genetics", "Pediatrics", "Biology", "Experience", "Pub Rate")
corr <- cor(cor_data)
corr.plot <- ggcorrplot(corr, type = "lower", tl.cex = 0.2, colors = c(colors[1], "#7201A840", colors[5])) + labs(title = "Correlation Matrix of Explanatory Variables")
ggplotly(corr.plot)

3. Independence

  • Assumption Requirement: The observations are independent
  • We can assume that the observations collected in the dataset are independent. Such that each observation is unique and do not influence each other.

4. Homoscedasticity

  • Assumption Requirement: The residuals have constant variance at every point in the linear model
  • The plot below illustrates the standardized residuals versus predicted values. Since the standardized residuals are scattered about zero with no clear pattern, there is no problem of homoscedasticity.
predicted <- predict(mlr)
standard_res <- rstandard(mlr)
homosced <- data.frame(predicted, standard_res)
plot_ly(data = homosced, x =~ predicted, y = ~standard_res, type = "scatter", mode = "markers", marker = list(color = colors[3]), opacity = .6) %>%
  layout(title = "Standardized Residuals vs. Predicted Values", xaxis = list(title = "Predicted Values"), yaxis = list(title = "Standardized Residuals"))

5. Multivariate Normality

  • Assumption Requirement: The residuals of the model are normally distributed
  • The Q-Q plot below visualizes whether or not the residuals of a model follow a normal distribution. If the points on the plot roughly form a straight diagonal line, then the normality assumption is met.
  • While there is a slight deviation from the line, the points generally follow straight diagonal line. Thus, we can conclude that the residuals roughly follow a normal distribution.
plt <- ggplot(train, aes(sample=Log.Salary)) + geom_qq(color = "#9C179E99") + geom_qq_line() + labs(title = "Normal Q-Q plot", x = "Theoretical Quantiles", y = "Sample Quantiles")
ggplotly(plt)

Assessing the Fit

Adjusted R-Squared

To assess the fit of the multiple linear regression, we will calculate adjusted R-squared (the coefficient of determination), which is the variation in the response variable explained by the multiple regression model, adjusting for the number of predictors in the model.

The R-squared of the multiple linear regression model above is 0.929. This means that 92.9% of the variability in the university doctor’s salary can be explained by the explanatory variables.

adjr2 <- summary(mlr)$adj.r.squared

Global F-test

The next step to assess the fit of the multiple linear regression is to conduct a global F-test to understand if the explanatory variables are significant predictors of salary. The steps below conduct a hypothesis test for the global F-test.

1. Set up the hypotheses and select the alpha level
  • \(H_0: \beta_1 = \beta_2 = ... = \beta_k = 0\) (all explanatory variables are not significant predictors of salary)
  • \(H_1: \beta_i \neq 0\) (at least one explanatory variable is a significant predictor of salary)
  • \(\alpha = 0.05\)
alpha <- 0.05
2. Select the appropriate test-statistic and degrees of freedom
  • \(F = \frac{MS Reg}{MS Res}\)
  • \(df = k, n-k-1\)
k <- ncol(train)-1
df1 <- k
df2 <- nrow(train) - k - 1
3. State the decision rule

Determine the appropriate critical value from the F-distribution table associated with a right hand tail probability of \(\alpha = 0.05\) and \(df = 12, 195\). The appropriate critical value is 1.96.

  • Decision Rule: Reject \(H_0\) if \(F \ge 1.96\)
  • Otherwise, do not reject \(H_0\)
critical.value <- qf(p = 0.95, df1 = df1, df2 = df2)
4. Compute the test statistic and the associated p-value
  • \(F = \frac{MS Reg}{MS Res} \approx 225\)
  • \(p-value = 2.2e^{-16}\)
mlr.summary <- summary(mlr)
5. Conclusion

Reject \(H_0\) since \(225 > 1.96\). We have significant evidence at the \(\alpha = 0.05\) level that the variables are significant predictors of salary. That is, there is evidence of a significant linear association between the salary of the university doctor and the associated explanatory variables.

Individual t-Tests

Since the global F-test concluded that at least one of the explanatory variables is a significant predictor of salary, the next step is to perform testing on each individual parameter to understand the contribution of each independent variable. To do so, we will perform individual t-tests, controlling for the other variables.

The following steps conduct a hypothesis test for each explanatory variable to understand the contribution of the variable in predicting salary, controlling for the other explanatory variables.

1. Set up the hypotheses and select the alpha level
  • \(H_0: \beta_i = 0\) (controlling for the other explanatory variables)
  • \(H_1: \beta_i \neq 0\) (controlling for the other explanatory variables)
  • \(\alpha = 0.05\)
alpha <- 0.05
2. Select the appropriate test-statistic and degrees of freedom
  • \(t = \frac{\hat{\beta}_i}{SE_{\hat{\beta_i}}}\)
  • \(df = n-k-1\)
df <- nrow(train)- k -1
3. State the decision rule

Determine the appropriate critical value from the standard t-distribution table associated with a right hand tail probability of \(\alpha = 0.05\) and \(df = 195\). The appropriate critical value is 1.653.

  • Decision Rule: Reject \(H_0\) if \(t \ge 1.653\) or \(t \le -1.653\)
  • Otherwise, do not reject \(H_0\)
critical.value <- qt(p=alpha, df=df, lower.tail=FALSE)
4. Compute the test statistic and the associated p-value

The output below shows the results from the individual t-tests for each explanatory variable:

results <- as.data.frame(mlr.summary$coefficients[,3:4])
colnames(results) <- c("t.Stat", "p.Value")
results$decision <- ifelse(results$p.Value < alpha, "Reject H0", "Fail to Reject H0")
kable(results)
t.Stat p.Value decision
(Intercept) 70.357634 0.0000000 Reject H0
Gender_Male 1.715522 0.0878380 Fail to Reject H0
Certified_Board_Certified 7.274367 0.0000000 Reject H0
Emphasis_Clinical 3.031747 0.0027618 Reject H0
Rank_Full_Professor 7.329002 0.0000000 Reject H0
Rank_Associate 4.937636 0.0000017 Reject H0
Dept_Surgery 14.070001 0.0000000 Reject H0
Dept_Medicine 12.100846 0.0000000 Reject H0
Dept_Genetics 7.254096 0.0000000 Reject H0
Dept_Pediatrics 4.732456 0.0000043 Reject H0
Dept_Biology 4.789135 0.0000033 Reject H0
Experience 8.781960 0.0000000 Reject H0
Publication.Rate -1.455615 0.1471071 Fail to Reject H0
5. Conclusion

For the majority of the explanatory variables, the decision is to reject \(H_0\) since the t-statistic is greater than the critical value. This concludes that the explanatory variable is a significant predictor for salary after adjusting for the other explanatory variables.

However, notice that for Gender (male) and publication rate, the decision is to fail to reject \(H_0\) since the t-statistic is less than the critical value. This concludes that the explanatory variable is not a significant predictor for salary after adjusting for the other explanatory variables. As a result, these variables should be considered for removal from the multiple regression model.

Before we developed a revised multiple regression model, let’s note that in the multiple regression model gender was not a statistically significant predictor of salary when controlling for the other explanatory variables.

Revised Model

In our revised multiple regression model, we only want to include variables that are statistically significant in predicting log salary. One at a time, we will remove varaiables that are insignificant to the model. The revised multiple regression model to predict salary removed gender and publication rate. After the removal of these two variables, all remaining variables were statistically significant in predicting log salary. The equation below represents the revised least-squares regression line for multiple linear regression to predict log salary based on the following explanatory variables:

  • Certified: \(x_1 =\) {1 - if board certified | 0 - if not certified}
  • Emphasis: \(x_2 =\) {1 - if clinical | 0 - if research}
  • Rank:
    • \(x_3 =\) {1 - if full professor | 0 - if not full professor}
    • \(x_4 =\) {1 - if associate | 0 - if not associate}
    • if \(x_3\) and \(x_4\) are 0, then rank is assistant
  • Department:
    • \(x_5 =\) {1 - if surgery | 0 - if not surgery}
    • \(x_6 =\) {1 - if medicine | 0 - if not medicine}
    • \(x_7 =\) {1 - if genetics | 0 - if not genetics}
    • \(x_8 =\) {1 - if pediatrics | 0 - if not pediatrics}
    • \(x_9 =\) {1 - if biology | 0 - if not biology}
    • if \(x_5\) through \(x_9\) are 0, then department physiology
  • Experience: \(x_{10}\)

\[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1{x_1} + \hat{\beta}_2{x_2} + \hat{\beta}_3{x_3} + \hat{\beta}_4{x_4} + \hat{\beta}_5{x_5} + \hat{\beta}_6{x_6} + \hat{\beta}_7{x_7} + \hat{\beta}_8{x_8} + \hat{\beta}_9{x_9} + \hat{\beta}_{10}{x_{10}}\] \[\hat{y} = 10.75 + 0.20{x_1} + 0.22{x_2} + 0.26{x_3} + 0.15{x_4} + 1.10{x_5} + 0.69{x_6} + 0.35{x_7} + 0.37{x_8} + 0.18{x_9} + 0.02{x_{10}}\]

mlr2 <- lm(Log.Salary ~ .-Gender_Male-Publication.Rate, data = train)
mlr2.summary <- summary(mlr2)

Assessing the Fit

  • Adjusted R-Squared: 94.3% of the variability in the university doctor’s salary can be explained by the explanatory variables
  • Global F-test: Reject \(H_0\) - there is evidence of a significant linear association between the salary of the university doctor and at least one of the associated explanatory variables
  • Individual t-tests: Reject \(H_0\) - all explanatory variables are significant predictors for salary after adjusting for the other explanatory variables included in the model

Coefficient Interpretations

  • \(\hat{\beta}_0\): $47,830.44 is the average salary when all independent variables are equal to 0
  • \(\hat{\beta}_1\): A 19.5% change in salary is associated with having board certification relative to not have board certification, after controlling for the other explanatory variables
  • \(\hat{\beta}_2\): A 22.5% change in salary is associated with having a clinical emphasis relative to having a research emphasis, after controlling for the other explanatory variables
  • \(\hat{\beta}_3\): A 26.4% change in salary is associated with a full professor relative to being an assistant professor, after controlling for the other explanatory variables
  • \(\hat{\beta}_4\): A 16.9% change in salary is associated with an associate professor relative to being an assistant professor, after controlling for the other explanatory variables
  • \(\hat{\beta}_5\): A 203.3% change in salary is associated with being in the surgery department relative to being in the Physiology department, after controlling for the other explanatory variables
  • \(\hat{\beta}_6\): A 102.6% change in salary is associated with being in the medicine department relative to being in the Physiology department, after controlling for the other explanatory variables
  • \(\hat{\beta}_7\): A 41.7% change in salary is associated with being in the genetics department relative to being in the Physiology department, after controlling for the other explanatory variables
  • \(\hat{\beta}_8\): A 42.9% change in salary is associated with being in the pediatrics department relative to being in the Physiology department, after controlling for the other explanatory variables
  • \(\hat{\beta}_9\): A 18.1% change in salary is associated with being in the biology department relative to being in the Physiology department, after controlling for the other explanatory variables
  • \(\hat{\beta}_{10}\): An increase of one year in experience is associated with a 1.8% change in salary, after controlling for the other explanatory variables
coef <- mlr2$coefficients

Cross Validation

Now that we have developed the multiple linear regression model, we can apply the model to the testing set to predict the outcome of new unseen observations. The output below displays the quality estimates from running cross validation 20 times and the average of the results.

The table below displays the average and the first few runs of the cross validation:

R.sq <- c()
RMSE <- c()
MAE <- c()

for(i in 1:20){
  
  smp.size <- floor(.8 * nrow(reg.data))
  train.index <- sample(seq_len(nrow(reg.data)), size = smp.size)
  train <- reg.data[train.index, ]
  test <- reg.data[-train.index, ]
  
  model <- lm(Log.Salary ~ .-Gender_Male-Publication.Rate, data = train)
  predictions <- predict(model, test)
  
  R.sq <- c(R.sq, R2(predictions, test$Log.Salary))
  RMSE <- c(RMSE, RMSE(predictions, test$Log.Salary))
  MAE <- c(MAE, MAE(predictions, test$Log.Salary))
  pred.error.rate <- RMSE / mean(test$Log.Salary)
  
}

quality.est <- data.frame(seq(1,20), round(R.sq,3), round(RMSE,3), round(MAE,3), round(pred.error.rate,3))
colnames(quality.est) <- c("Run", "R-Squared", "RMSE", "MAE", "Prediction Error Rate")

average <- c("Average", round(mean(R.sq), 3), round(mean(RMSE), 3), round(mean(MAE), 3), round(mean(pred.error.rate), 3))
quality.est <- rbind(average, quality.est) 

kable(head(quality.est))
Run R-Squared RMSE MAE Prediction Error Rate
Average 0.919 0.142 0.105 0.012
1 0.95 0.119 0.087 0.01
2 0.93 0.136 0.11 0.012
3 0.954 0.12 0.095 0.01
4 0.95 0.128 0.1 0.011
5 0.954 0.134 0.107 0.011
mean.R.sq <- round(mean(R.sq) * 100, 1)
mean.RMSE <- round(mean(RMSE), 3)
mean.pred.er <- round(mean(pred.error.rate) * 100, 1)

Overall, the results from the testing test indicate that the model is suitable in predicting salary. On average, 91.9% of the variability in the university doctor’s salary can be explained by the explanatory variables. Note that the average from the testing set is only slightly worse than the training set (R-squared = 94.3%). On average, the prediction for log salary is off by $0.142 based on RMSE, resulting in a prediction error rate of 1.2%.

Two-Sample Tests for Proportions

For majority of this analysis, the salary disparities between male university doctors and female university doctors was the focal point. In this section, we are interested in investigating a pattern of inequalities against women in receiving promotions (i.e. full professors). Such that, we are interested if the underlying population proportion of full professors among male university doctors is greater than the population proportion of full professors among female university doctors.

The table below summarizes the information required to perform a two-sample test for proportions. Note that “success” is defined as being a full professor.

prop.df <- data.frame(
  Population = c(1, 2),
  Description = c("Male", "Female"),
  Sample.Size = c(nrow(male), nrow(female)),
  Count.Successes = c(nrow(male[male$Rank == "Full_Professor",]), nrow(female[female$Rank == "Full_Professor",])),
  Count.Failures = c(nrow(male[male$Rank != "Full_Professor",]), nrow(female[female$Rank != "Full_Professor",])))
prop.df$Sample.Proportion = round(prop.df$Count.Successes/prop.df$Sample.Size,4)

kable(prop.df)
Population Description Sample.Size Count.Successes Count.Failures Sample.Proportion
1 Male 155 69 86 0.4452
2 Female 106 16 90 0.1509

Hypothesis Test

The following steps conduct a hypothesis test to make an inference about the underlying population proportion of full professors among male university doctors versus female university doctors.

1. Set up the hypotheses and select the alpha level

Let \(p_1\) = proportion of full professors that are male and \(p_2\) = the proportion of full professors that are female

  • \(H_0: p_1 = p_2\) (the underlying population proportion of full professors among male university doctors is the same as the population proportion of of full professors among female university doctors)
  • \(H_1: p_1 > p_2\) (the underlying population proportion of full professors among male university doctors is greater than the population proportion of of full professors among female university doctors)
  • \(\alpha = 0.05\)
alpha <- 0.05
2. Select the appropriate test-statistic and degrees of freedom
  • \(z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}(1 - \hat{p}) * \frac{1}{n_1} + \frac{1}{n_2}}}\)
3. State the decision rule

Determine the appropriate critical value from the standard normal distribution associated with a right hand tail probability of \(\alpha = 0.05\). The appropriate critical value is 1.64.

  • Decision Rule: Reject \(H_0\) if \(z \ge 1.64\)
  • Otherwise, do not reject \(H_0\)
critical.value <- qnorm(1 - alpha)
4. Compute the test statistic and the associated p-value
  • \(z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}(1 - \hat{p}) * \frac{1}{n_1} + \frac{1}{n_2}}} \approx 4.76\)
  • \(p-value = 6.27e^{-7}\)
p <- sum(prop.df$Sample.Proportion)
p1 <- prop.df$Sample.Proportion[1]
p2 <- prop.df$Sample.Proportion[2]
s1 <- prop.df$Count.Successes[1]
s2 <- prop.df$Count.Successes[2]
n1 <- prop.df$Sample.Size[1]
n2 <- prop.df$Sample.Size[2]
z <- (p1 - p2) / (sqrt( p*(1-p) * ((1/n1)+(1/n2)) ))

prop.test <- prop.test(c(s1, s2), c(n1, n2), conf.level=alpha, alternative = "greater")
5. Conclusion

Reject \(H_0\) since \(4.76 > 1.64\). There is significant evidence at the \(\alpha = 0.05\) level that the proportion of male university doctor full professors is greater than the proportion of female university doctor full professors. The percentage of percentage of full professors was approximately 29% lower for female doctors than male doctors.

Confidence Interval

The confidence interval for the difference in population proportions is given by the following formula:

  • \(\hat{p}_1 - \hat{p}_2 \pm z * {\sqrt{\frac{\hat{p}_1(1 - \hat{p}_1)}{n_1} + \frac{\hat{p}_2(1 - \hat{p}_2)}{n_2}}}\)
  • (0.191, 0.398)

We are 95% confident that the percentage of university doctors that are full professor is between 19.1% and 39.8% lower among female doctors than male doctors.

dif <- p1 - p2
se <- sqrt( (p1*(1-p1)/n1) + (p2*(1-p2)/n2) )

lower <- dif - (critical.value * se)
upper <- dif + (critical.value * se)

Conclusion

The goal of this project was to analyze the dataset used in a workplace gender discrimination case and utilize various statistical techniques to conduct inferences about gender pay and promotion inequalities. Throughout the analysis, we investigated questions to understand the population mean salary differences between male and female university doctors, predict a university doctor’s salary and determine which variables are the most significant predictors, and examine the proportion of full professors that are male versus female.

Overall, there is no disputing that male university doctors are paid significantly higher than female university doctors. However, this may or may not uphold when controlling for variables such as professorship level, certification, and emphasis. When doing so, there was significant statistical evidence that male university doctors are paid higher than female university doctors only for assistant and associate professors that are board certified and in the clinical emphasis. Additionally, from the multiple linear regression we found that gender is not a significant predictor of the university doctor’s salary when including the other variables in the dataset. Other factors such as department and professorship level play a larger role in predicting salary. This does not discount the notable inequality between the proportion of male full professors and female full professors. As a result, if a lower proportion of female university doctors are promoted to a full professor, then the female university doctor is likely to have a lower salary than male university doctors.